Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes leader election] Run leader elector at all times #4542

Merged
merged 18 commits into from
Apr 22, 2024

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Apr 8, 2024

What does this PR do?

This PR covers the issue elastic/beats#38543.

The fix was already merged for beats: elastic/beats#38471.

Now we need to align the agent with the same logic as well.

We do that by making sure the leader elector is always running. Previously, when an instance lost the lease, the leader elector would finish running, and it would not start again.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

How to test

  1. Build elastic-agent:
EXTERNAL=true SNAPSHOT=true DEV=true PLATFORMS=linux/amd64 PACKAGES=docker mage package
  1. Create stack and kind cluster, and connect the cluster to the stack:
elastic-package stack up --version=8.14.0-SNAPSHOT
kind create cluster
docker network connect elastic-package-stack_default kind-control-plane
  1. Build docker image:
cd build/package/elastic-agent/elastic-agent-linux-amd64.docker/docker-build
docker build -t custom-agent-image .
  1. Load this image in your kind cluster:
kind load docker-image custom-agent-image:latest
  1. Deploy agent with that image:
containers:
  - name: elastic-agent
    image: custom-agent-image:latest
    imagePullPolicy: Never

Results

The following test was done to make sure the leader elector run again after losing the lease:

  1. Create 2 node cluster, so there are two agent running:

  2. Change the lease, so that the first leader loses it.
    First leader:

c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-kmx9m                         9m24s
Change the lease holder. You can do it like this.
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: elastic-agent-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: change

After a while, check the new leader:

c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-67x8n                         11m
  1. Change the lease again, and make sure the leader gets repeated. This way we know that the leader elector kept running.
c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-kmx9m                         14m

As we can see, the leader repeated.

@constanca-m constanca-m added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Apr 8, 2024
@constanca-m constanca-m requested a review from a team April 8, 2024 13:32
@constanca-m constanca-m self-assigned this Apr 8, 2024
@constanca-m constanca-m requested review from gizas and tetianakravchenko and removed request for a team April 8, 2024 13:32
@constanca-m constanca-m requested a review from a team as a code owner April 8, 2024 13:32
@constanca-m constanca-m requested review from blakerouse and pchila April 8, 2024 13:32
Copy link
Contributor

mergify bot commented Apr 8, 2024

This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip label Apr 8, 2024
Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
constanca-m and others added 2 commits April 8, 2024 16:36
…ernetes_leaderelection.go

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes until we get an answer on if this needs rate limiting and the impacts on the k8s control plane traffic.

The current implementation definitely needs to change, but it's not clear to me that this change won't just make rapid calls to acquire leases as fast as possible yet.

return comm.Err()

for {
le.Run(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a cluster error resulting in the leader continuously losing the lease, will this result in attempts to acquire it as quickly as possible with no rate limit?

The implementation of Run I see is:

// Run starts the leader election loop. Run will not return
// before leader election loop is stopped by ctx or it has
// stopped holding the leader lease
func (le *LeaderElector) Run(ctx context.Context) {
	defer runtime.HandleCrash()
	defer func() {
		le.config.Callbacks.OnStoppedLeading()
	}()

	if !le.acquire(ctx) {
		return // ctx signalled done
	}
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()
	go le.config.Callbacks.OnStartedLeading(ctx)
	le.renew(ctx)
}

There have been several escalations showing that failing to appropriately rate limit k8s control plane API calls like leader election can destabilize clusters.

What testing have we done to ensure this change won't cause issues like this?

Copy link
Contributor Author

@constanca-m constanca-m Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already been doing this. This change will not affect the number of calls, it just makes sure that at least 1 agent will be reporting metrics. The problem with the implementation now is that run goes like this:

  • Keeps trying to acquire the lease
  • Acquire the lease
  • Loose the lease and stop running
    All the while, all the other instances were trying to acquire the lease already.

We do not have many SDHs on this bug, which leads me to believe that it is rare for an agent to lose the lease. But we do have problems when agents stop reporting metrics, and the only way we knew how to restart that was to make the pod run again - that is, force run() to run again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some parameters in the config regarding the lease:

LeaseDuration int `config:"leader_leaseduration"`
RenewDeadline int `config:"leader_renewdeadline"`
RetryPeriod int `config:"leader_retryperiod"`

If it becomes necessary to reduce the amount of times an agent tries to acquire it.

@cmacknz
Copy link
Member

cmacknz commented Apr 8, 2024

Before I remove my requested changes, do we have a way to test this in this repository?

It looks like you'd need a test that you can get the lease again after you've lost it.

I see it is possible to inject the client used in the lock implementation, which takes an interface:

I also see a fake client is used in the secrets tests:

Is is possible to mock out the lease client to the test this?

@constanca-m
Copy link
Contributor Author

constanca-m commented Apr 8, 2024

Before I remove my requested changes, do we have a way to test this in this repository?

For the beats PR, I did add an unit test to check that. However, we had a way to change the start leading and stop function there, and this repository I cannot get inside those functions to check if that is working. Because of that, I don't think it is possible to test.

There is a way to force the lease to be updated, and sometimes the leader changes. The problem is that we are working with seconds in the lease duration fields, so if I push a test for that, it might take 2min, is that a good idea? I cannot do less than seconds for those fields unlike in the beats PR, because these fields are integers in seconds, and changing them to time.duration means a breaking change. @cmacknz

@cmacknz
Copy link
Member

cmacknz commented Apr 8, 2024

For the beats PR, I did add an unit test to check that. However, we had a way to change the start leading and stop function there, and this repository I cannot get inside those functions to check if that is working. Because of that, I don't think it is possible to test.

Can't you allow access to the lease callbacks you need through an internal constructor, so that the ContextProviderBuilder used externally remains unmodified?

There is a way to force the lease to be updated, and sometimes the leader changes. The problem is that we are working with seconds in the lease duration fields, so if I push a test for that, it might take 2min, is that a good idea? I cannot do less than seconds for those fields unlike in the beats PR, because these fields are integers in seconds, and changing them to time.duration means a breaking change. @cmacknz

As above you could bypass the Config object in ContextProviderBuilder(logger *logger.Logger, c *config.Config, managed bool) using an internal constructor for tests that gives you access to the underlying time.Duration fields.

It seems possible to test this here, although it does look like it'll be more work than it was to write than it was in Beats.

The upstream tests are also an interesting source of inspiration https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection_test.go, but they test at a lower level and use a lot more test machinery from their own packages.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made the requested changes I asked for and those changes look good.

I agree with Craig on all points for adding unit tests to cover this code

@constanca-m
Copy link
Contributor Author

I added a unit test @blakerouse @cmacknz. The test does the following:

// TestNewLeaderElectionManager will test the leader elector.
// We will try to check if an instance can acquire the lease more than one time. This way, we will know that
// the leader elector starts running again after it has stopped - which happens once a leader looses the lease.
// To make sure that happens we will do the following:
// 1. We will create the lease to be used by the leader elector.
// 2. We will create two context providers - in the default context, this would mean two nodes, each one with an agent running.
// We will wait for one of the agents, agent1, to acquire the lease, before starting the other.
// 3. We force the lease to be acquired by the other agent, agent2.
// 4. We will force the lease to be acquired by the agent1 again. To avoid the agent2 reacquiring it multiple times,
// we will stop this provider and make sure the agent1 can reacquire it.


select {
case <-done:
case <-time.After(time.Duration(leaseDuration+leaseRetryPeriod) * 20 * time.Second):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this seems possible to happen...

There are only two candidates two acquire the lease, agent1 and agent2. In this case, we need to wait for agent2 to acquire it. Since I am forcing a new leader, this should imply agent1 losing the lease, which also means the leader elector needs to start again. This should be enough time for agent2 to acquire it.

However, I still added the timeout to make sure this test does not run for very, very long. Should I remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows unit tests hit the timeout... So I removed it. This unit test might take a while to run.

@constanca-m
Copy link
Contributor Author

/test


getK8sClientFunc = func(kubeconfig string, opt autodiscoverK8s.KubeClientOptions) (kubernetes.Interface, error) {
return client, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be nice would be setting the getK8sClientFunc back to the original after the test. Something like:

defer func() {
   getK8sClientFunc = defaultGetK8sClientFunc
}()


// We need to wait for the first agent to acquire the lease, so we can POD_NAME environment variable again
expectedLeader := leaderElectorPrefix + podNames[i]
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What holder == expectedLeader doesn't happens here if that never happens? Seems this will loop forever?

might be better to switch to require.Eventually with a timeout just to ensure and a break of the code somewhere doesn't have this block the unit test forever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a timeout. I explained in this comment why I removed it before #4542 (comment).

require.NoError(t, err)

// In this case, we already have an agent as holder
if currentHolder == leaderElectorPrefix+podNames[0] || currentHolder == leaderElectorPrefix+podNames[1] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not adding a timeout to this for cycle because this is a sure thing to happen, since there are only 2 pods running to acquire the lease.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test!

constanca-m and others added 2 commits April 18, 2024 08:50
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
…ernetes_leaderelection_test.go

Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
@cmacknz cmacknz added the backport-v8.14.0 Automated backport with mergify label Apr 18, 2024
@cmacknz
Copy link
Member

cmacknz commented Apr 18, 2024

Added the backport-v8.14.0 label as this is a bug fix. Feel free to remove it if you do not want this in 8.14.0.

@mergify mergify bot removed the backport-skip label Apr 18, 2024
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@constanca-m constanca-m merged commit d48ac97 into elastic:main Apr 22, 2024
11 checks passed
@constanca-m constanca-m deleted the k8s-leader-election branch April 22, 2024 10:11
mergify bot pushed a commit that referenced this pull request Apr 22, 2024
* run leader elector at all times

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit d48ac97)
constanca-m added a commit that referenced this pull request Apr 22, 2024
…4604)

* run leader elector at all times

---------

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Co-authored-by: Craig MacKenzie <craig.mackenzie@elastic.co>
(cherry picked from commit d48ac97)

Co-authored-by: Constança Manteigas <113898685+constanca-m@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.14.0 Automated backport with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
3 participants